Risk-Sensitive Planning with One-Switch Utility Functions: Value Iteration

نویسندگان

  • Yaxin Liu
  • Sven Koenig
چکیده

Decision-theoretic planning with nonlinear utility functions is important since decision makers are often risk-sensitive in high-stake planning situations. One-switch utility functions are an important class of nonlinear utility functions that can model decision makers whose decisions change with their wealth level. We study how to maximize the expected utility of a Markov decision problem for a given one-switch utility function, which is difficult since the resulting planning problem is not decomposable. We first study an approach that augments the states of the Markov decision problem with the wealth level. The properties of the resulting infinite Markov decision problem then allow us to generalize the standard risk-neutral version of value iteration from manipulating values to manipulating functions that map wealth levels to values. We use a probabilistic blocks-world example to demonstrate that the resulting risk-sensitive version of value iteration is practical. Introduction Utility theory (von Neumann & Morgenstern, 1944) is a normative theory of decision making under uncertainty. It states that every rational decision maker who accepts a small number of axioms has a strictly monotonically increasing utility function that transforms their wealth level w into a utility U(w) so that they always choose the course of action that maximizes their expected utility. The utility function models their risk attitudes. A decision maker is risk-neutral if their utility function is linear, risk-averse if their utility function is concave, and risk-seeking if their utility function is convex. Decision-theoretic planning with nonlinear utility functions is important since decision makers are often risk-sensitive in high-stake planning situations (= planning situations with the possibility of large wins or losses) and their risk attitude affects their decisions. For example, some decision makers buy insurance in business decision situations and some do not. Furthermore, their decisions often change with their wealth level. In particular, they are often risk-averse but become risk-neutral in the limit as their wealth level increases. One-switch utility functions are an important class of nonlinear utility functions that can model such decision makers. We model probabilistic planning problems as fully observCopyright c © 2005, American Association for Artificial Intelligence (www.aaai.org). All rights reserved. able Goal-Directed Markov Decision Problems (GDMDPs) and investigate how to maximize their expected utility for a given one-switch utility function, which is difficult since the resulting planning problem is not decomposable. The optimal course of action now depends not only on the current state of the GDMDP but also the wealth level (= accumulated rewards). Thus, we first study an approach that transforms a risk-sensitive GDMDP into a risk-neutral one, basically by augmenting the states of the risk-sensitive GDMDP with the possible wealth levels. The resulting risk-neutral GDMDP has an infinite state space but its properties allow us to generalize the standard risk-neutral version of value iteration, which manipulates values (one for each state), to a risk-sensitive version of value iteration, which manipulates functions (one for each state) that map wealth levels to values. We use a probabilistic blocks-world example to demonstrate that the resulting risk-sensitive version of value iteration is practical. Our research is intended to be a first step toward better probabilistic planners for high-stake planning situations such as environmental crisis situations (Blythe, 1997), business decisions situations (Goodwin, Akkiraju, & Wu, 2002), and planning situations in space (Zilberstein et al., 2002). GDMDPs We model probabilistic planning problems as finite GoalDirected Markov Decision Problems (GDMDPs), which are characterized by a finite set of states S, a finite set of goal states G ⊆ S, and a finite set of actions A that can be executed in all non-goal states s ∈ S\G. The decision maker always chooses which action a ∈ A to execute in their current non-goal state s ∈ S \ G. Its execution results with probability P (s|s, a) in finite (immediate) reward r(s, a, s) < 0 and a transition to state s ∈ S in the next time step. The decision maker stops acting when they reach a goal state s ∈ G, which is modeled as them executing a dummy action whose execution results with probability 1.0 in reward 0.0 and leaves their current goal state unchanged. Ht denotes the set of all histories at time step t ≥ 0. A history at time step t is any sequence ht = (s0, a0, · · · , st−1, at−1, st) ∈ (S × A) × S of states and actions from the state at time step 0 to the current state at time step t that can occur with positive probability if the decision maker executes the corresponsing actions in sequence. The (planning) horizon of a decision maker is the number of time steps 1 ≤ T ≤ ∞ that they plan for. A trajectory is an element of HT . Decision-theoretic planners determine policies, where a policy π consists of a decision rule dt for every time step 0 ≤ t < T within the horizon. A decision rule determines which action the decision maker should execute in their current state. The most general policies are those that consist of potentially different decision rules for the time steps, where every decision rule is a mapping from histories at the current time step to probability distributions over actions, called randomized history-dependent (HR) decision rules. We denote the class of such policies as ΠHR. More restricted policies consist of the same decision rule for every time step, where the decision rule is a mapping from only the current state to actions, called deterministic stationary (SD) decision rules. We denote the class of such policies as ΠSD. Consider a decision maker with an arbitrary utility function U . If the horizon T is finite and the decision maker starts in initial state s ∈ S, then the expected utility of their total reward under policy π ∈ ΠHR is

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An exact algorithm for solving MDPs under risk-sensitive planning objectives with one-switch utility functions

One-switch utility functions are an important class of nonlinear utility functions that can model human beings whose decisions change with their wealth level. We study how to maximize the expected utility for Markov decision problems with given one-switch utility functions. We first utilize the fact that one-switch utility functions are weighted sums of linear and exponential utility functions ...

متن کامل

Functional Value Iteration for Decision-Theoretic Planning with General Utility Functions

We study how to find plans that maximize the expected total utility for a given MDP, a planning objective that is important for decision making in high-stakes domains. The optimal actions can now depend on the total reward that has been accumulated so far in addition to the current state. We extend our previous work on functional value iteration from one-switch utility functions to all utility ...

متن کامل

Risk-sensitive planning in partially observable environments

Partially Observable Markov Decision Process (POMDP) is a popular framework for planning under uncertainty in partially observable domains. Yet, the POMDP model is riskneutral in that it assumes that the agent is maximizing the expected reward of its actions. In contrast, in domains like financial planning, it is often required that the agent decisions are risk-sensitive (maximize the utility o...

متن کامل

Risk-Sensitive Planning in Partially Observable Environments

Partially Observable Markov Decision Process (POMDP) is a popular framework for planning under uncertainty in partially observable domains. Yet, the POMDP model is riskneutral in that it assumes that the agent is maximizing the expected reward of its actions. In contrast, in domains like financial planning, it is often required that the agent decisions are risk-sensitive (maximize the utility o...

متن کامل

Probabilistic Planning with Risk-Sensitive Criterion

Probabilistic planning models and, in particular, Markov Decision Processes (MDPs), Partially Observable Markov Decision Processes (POMDPs) and Decentralized Partially Observable Markov Decision Processes (Dec-POMDPs) have been extensively used by AI and Decision Theoretic communities for planning under uncertainty. Typically, the solvers for probabilistic planning models find policies that min...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005